\[~\]

The Final Project

\[~\]

Author \(\rightarrow\) Jeremy Sapienza 1960498
Statistics for Data Science and Laboratory II
Sapienza University of Rome
July 21th 2021

\[~\]

Guessing, if there is a Heart Attack or not

\[~\]

\[~\]

\[~\]

\[~\]

The Introduction

\[~\]

Nowadays, with the improvement of the technologies there are an amount of data that allows us to understand if there is any problem or not in the healthy of a person.

Here in this project we explain and we try to predict whether a patient should be diagnosed with Heart Disease or not.

\[~\]

The Dataset

\[~\]

\[~\]

Kaggle, is the main platform where I found this interesting dataset where applying our main bayesian inference as the main scope of this project.

This dataset is composed by these features:

## 
## -- Column specification --------------------------------------------------------
## cols(
##   age = col_double(),
##   sex = col_double(),
##   cp = col_double(),
##   trtbps = col_double(),
##   chol = col_double(),
##   fbs = col_double(),
##   restecg = col_double(),
##   thalachh = col_double(),
##   exng = col_double(),
##   oldpeak = col_double(),
##   slp = col_double(),
##   caa = col_double(),
##   thall = col_double(),
##   output = col_double()
## )
## # A tibble: 6 x 14
##     age   sex    cp trtbps  chol   fbs restecg thalachh  exng oldpeak   slp
##   <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl>   <dbl>    <dbl> <dbl>   <dbl> <dbl>
## 1    63     1     3    145   233     1       0      150     0     2.3     0
## 2    37     1     2    130   250     0       1      187     0     3.5     0
## 3    41     0     1    130   204     0       0      172     0     1.4     2
## 4    56     1     1    120   236     0       1      178     0     0.8     2
## 5    57     0     0    120   354     0       1      163     1     0.6     2
## 6    57     1     0    140   192     0       1      148     0     0.4     1
## # ... with 3 more variables: caa <dbl>, thall <dbl>, output <dbl>

\[~\]

Analyzing these features, we captured some interesting information for each feature within the dataset:

  1. Age -> it refers to the age of the patient
  2. sex -> it refers if the patient is:
    • male (1)
    • female (0)
  3. cp -> it refers to the chest pain type that could be of four types:
    • 0 is the typical angina
    • 1 is the atypical angina
    • 2 is the non-anginal pain
    • 3 is the asymptomatic type
  4. trestbps -> it refers to the resting blood pressure
  5. chol -> it refers to the serum cholesterol in mg/dl
  6. fbs -> it refers to the fasting blood sugar that if it is larger than 120 mg/dl is represented with:
    • 1 (true)
    • 0 (false)
  7. restecg -> it refers to the resting electrocardiographic results:
    • 0 as Normal
    • 1 having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • 2 showing probable or definite left ventricular hypertrophy by Estes’ criteria
  8. thalachh -> it refers to the maximum heart rate achieved
  9. exng -> it refers to the exercise induced angina that could be:
    • 1 (yes)
    • 0 (no)
  10. oldpeak -> it refers to the previous peak
  11. slp -> it refers to the peak exercise ST segment:
    • 0 as up sloping
    • 1 as flat
    • 2 as down sloping
  12. caa -> it refers to the number of major vessels that goes from 0 to 3
  13. thall -> it refers to the diagnostic maximum heart rate achieved:
    • 0 as no-data
    • 1 as normal
    • 2 as fixed defect
    • 3 as reversible defect
  14. output -> it refers to the fact to have the heart attack or not

\[~\]

As we can see above, we have qualitative and quantitative data that will be well explained in the next subchapters.

The main features detected have these summaries:

\[~\]

##       age             sex               cp             trtbps     
##  Min.   :29.00   Min.   :0.0000   Min.   :0.0000   Min.   : 94.0  
##  1st Qu.:48.00   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:120.0  
##  Median :55.50   Median :1.0000   Median :1.0000   Median :130.0  
##  Mean   :54.42   Mean   :0.6821   Mean   :0.9636   Mean   :131.6  
##  3rd Qu.:61.00   3rd Qu.:1.0000   3rd Qu.:2.0000   3rd Qu.:140.0  
##  Max.   :77.00   Max.   :1.0000   Max.   :3.0000   Max.   :200.0  
##       chol            fbs           restecg          thalachh    
##  Min.   :126.0   Min.   :0.000   Min.   :0.0000   Min.   : 71.0  
##  1st Qu.:211.0   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:133.2  
##  Median :240.5   Median :0.000   Median :1.0000   Median :152.5  
##  Mean   :246.5   Mean   :0.149   Mean   :0.5265   Mean   :149.6  
##  3rd Qu.:274.8   3rd Qu.:0.000   3rd Qu.:1.0000   3rd Qu.:166.0  
##  Max.   :564.0   Max.   :1.000   Max.   :2.0000   Max.   :202.0  
##       exng           oldpeak           slp             caa        
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.800   Median :1.000   Median :0.0000  
##  Mean   :0.3278   Mean   :1.043   Mean   :1.397   Mean   :0.7185  
##  3rd Qu.:1.0000   3rd Qu.:1.600   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :6.200   Max.   :2.000   Max.   :4.0000  
##      thall           output     
##  Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:0.000  
##  Median :2.000   Median :1.000  
##  Mean   :2.315   Mean   :0.543  
##  3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :1.000

\[~\] As we can see above:

  1. The people captured in this dataset represents people between 29 and 77 years, so it is evident that there aren’t teenagers in this dataset. Because, there is a low probability that the teenagers have the heart attack.
  2. The genders in this dataset are equally distributed between females and males.
  3. The output feature is equally distributed between people that have a heart attack and not.

…Behaviours of the other features within the dataset could be seen in this summary, so let’s go to the data visualization for having a good visualizations of which data we are treating!

\[~\]

The Data Visualization

\[~\]

In this subsection we want to describe graphically the dataset that we are analyzing to have a first taste of what is the better model to use in this case, we want to show below some interesting plots.

\[~\]

The Visualization Using PCA

\[~\]

PCA is a particular (visualization, sometimes) tool that allows us to show the patients similar each other. Essentially, the principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest. Also PCA is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible.

\[~\]

\[~\]

Its normal to denote that there could be some losses on this representation due to the amount of features that we want to consider, but here we actually recover 89.8% of variance in the entire dataset using two principal components, so this is a good preservation of the result.

\[~\]

The Categorical Data

\[~\]

Here, we want to analyze the percentages of each categorical data:

\[~\]

The most relevant is the typical angina (47.4%) that is defined as substernal chest pain precipitated by physical exertion or emotional stress and relieved with rest or nitroglycerin (represents a particular decrease of flow of the blood and oxygen to the heart, in order to relax the patient. This allows to pass well the blood within the body’s patient).

Women and elderly patients are usually have atypical symptoms both at rest and during stress, often in the setting of nonobstructive coronary artery disease (CAD).

\[~\]

the ST/heart rate slope (ST/HR slope), is proposed as a more accurate ECG criterion for diagnosing significant coronary artery disease (CAD).

The most relevant rate slope is approximately between the first and second type (as flat and as down sloping) these are qualitative data that show how it is the type of heart rate frequence.

\[~\]

The most case is 0 major vessels interested, this means that they aren’t effect of any damaged. In other little case we can see that few vessels are damaged (around 1-2 vessels).

\[~\]

the maximum rate achieved is about a fixed defect.

\[~\]

Angina may feel like pressure in the chest, jaw or arm. It frequently may occur with exercise or stress. Some people with angina also report feeling lightheaded, overly tired, short of breath or nauseated.

As the heart pumps harder to keep up with what you are doing, it needs more oxygen-rich blood. This reflects to the fact that we have no exercise induced by angina.

\[~\]

\[~\]

The Quantitative Data

\[~\]

Here, we illustrate the main features of the quantitative data after have seen the categorical data in the previous section:

\[~\]

hist.and.summary('age', 'Persons Age')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   29.00   48.00   55.50   54.42   61.00   77.00
hist.and.summary('chol', 'Cholestoral')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   211.0   240.5   246.5   274.8   564.0
hist.and.summary('thalachh', 'Maximum Heart Rate Achieved')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    71.0   133.2   152.5   149.6   166.0   202.0
hist.and.summary('trtbps', 'Resting Blood Pressure')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    94.0   120.0   130.0   131.6   140.0   200.0
hist.and.summary('oldpeak', 'Previous Peak')
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.800   1.043   1.600   6.200

\[~\]

As we can see above, the distributions of these data are different and several are more similar, for example if you compare the maximum heart rate and the age of the people these distributions are different! So, these considerations could be useful for predicting the heart attack.

After this, we consider also the densities considering the case of having the heart attack and not:

\[~\]

dense.chart('age', 'Persons Age')
dense.chart('chol', 'Cholestoral')
dense.chart('thalachh', 'Maximum Heart Rate Achieved')
dense.chart('trtbps', 'Resting Blood Pressure')
dense.chart('oldpeak', 'Previous Peak')

The Correlations

\[~\]

Here, we want to highlight the correlations between the features treated in qualitative, quantitative and without any distinction.

\[~\]

The correlations are measured considering the Pearson formula:

\[ \rho_{XY} = \frac{Cov(X,Y)}{\sigma_X \cdot \sigma_Y} \] Where:

  • \(Cov(X,Y)\) is the covariance between the two sets of values X and Y
  • \(\sigma_X\) is the deviation standard of the set X
  • \(\sigma_Y\) is the deviation standard of the set Y

With these correlation maps we see some interesting correlations is important to denote that the quantitative variables are more significant than the qualitative variables, because we are treating the quantitative measures, but here we can highlight and consider also the qualitative variables. So, here we denote in the complete map that the:

  • thalachh and age \(\rightarrow\) have -0.40 of correlations
  • cp and exng \(\rightarrow\) have -0.39 of correlations
  • thalachh and exng \(\rightarrow\) have -0.38 of correlations
  • slp and oldpeak \(\rightarrow\) have -0.58 of correlations

seems that these variables will be important in our predictions, we will see in few moment if it is in this way!

\[~\]

Preliminary Brief Definitions

\[~\]

\[~\]

The Goal (+ Cheating And Using The Frequentistic Logistic Regression Approach)

\[~\]

The main goal is to leverage the main fully Bayesian analysis, based on understanding if a person has a heart attack or not.

For testing the model in the prediction phase, we decide to split the dataset in train (80%) and test set (20%). In order to be able to see how much is better our model. I decided also to scale the values for having a well representation on the model with the scaled values to calculate easily them.

So, the principal components of the first model are:

  1. The response variable \(Y_i\) (the “output” feature) that is the binary value that indicates if there is a heart attack or not, so it \(\in \{0,1\}\)

  2. The predictor variables (\(x_i \in \mathbb{R}^{+}\)) that are chosen using the glm() function (we are cheating, but in this way we know before which are the right variables and at the same time we tried the frequentist approach), used to fit generalized linear models

As we can see below, if we consider all the variables within the Logistic Regression model, we see:

\[~\]

## 
## Call:
## glm(formula = y ~ x1 + x2 + x3 + x4 + x5 + x6 + x7 + x8 + x9 + 
##     x10 + x11 + x12 + x13, family = binomial(link = "logit"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5551  -0.4208   0.1495   0.6217   2.3968  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.52933    0.46303   3.303 0.000957 ***
## x1           0.10353    0.23347   0.443 0.657460    
## x2          -1.75631    0.52136  -3.369 0.000755 ***
## x3           0.94949    0.21706   4.374 1.22e-05 ***
## x4          -0.33423    0.19701  -1.697 0.089785 .  
## x5          -0.29725    0.21885  -1.358 0.174382    
## x6          -0.01219    0.57688  -0.021 0.983141    
## x7           0.26634    0.19699   1.352 0.176368    
## x8           0.38210    0.25303   1.510 0.131015    
## x9          -0.73391    0.46097  -1.592 0.111363    
## x10         -0.58441    0.26382  -2.215 0.026750 *  
## x11          0.43430    0.23503   1.848 0.064623 .  
## x12         -0.80081    0.22693  -3.529 0.000417 ***
## x13         -0.46325    0.19466  -2.380 0.017321 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.16  on 240  degrees of freedom
## Residual deviance: 176.60  on 227  degrees of freedom
## AIC: 204.6
## 
## Number of Fisher Scoring iterations: 6

\[~\]

As we can see above, the glm() function rejects the hypothesis to consider these variables:

… the intercept in this case is a good choice to admit it in our model.

An important remark to highlight is that we tried to do the frequentist approach of these data and we should remember that considering all of the features we achieved 204.6 as AIC value.

But, if we try with the features that we decide to consider in the next steps?

summary(glm(y ~ x2+x3+x4+x10+x11+x12+x13, family = binomial(link = "logit"),data=dat)) # frequentistic approach
## 
## Call:
## glm(formula = y ~ x2 + x3 + x4 + x10 + x11 + x12 + x13, family = binomial(link = "logit"), 
##     data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6033  -0.5454   0.1627   0.6219   2.5898  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.2029     0.3984   3.020 0.002531 ** 
## x2           -1.5464     0.4557  -3.394 0.000689 ***
## x3            1.1094     0.2031   5.462  4.7e-08 ***
## x4           -0.3561     0.1874  -1.900 0.057444 .  
## x10          -0.6548     0.2496  -2.623 0.008718 ** 
## x11           0.6082     0.2231   2.727 0.006397 ** 
## x12          -0.7987     0.2113  -3.780 0.000157 ***
## x13          -0.5084     0.1820  -2.794 0.005209 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 333.16  on 240  degrees of freedom
## Residual deviance: 187.04  on 233  degrees of freedom
## AIC: 203.04
## 
## Number of Fisher Scoring iterations: 5

The behaviour is slightly changed, we have 203.04 as AIC value, we will see if we achieve the same result!

We considered these features, after manually checking which are the best to consider, we adopted the technique of feature engineering.

Then, we decided to consider another models and see the difference on which model at the end is the best model of our analysis.

Then, in summary we have N = 302 observations in our dataset well distributed.

\[~\]

The Second Model

\[~\] Now, we want to compare the first model with another model, we consider the logistic regression but inserting a polynomial function \(cloglog(p_i) = \beta_{4_i} \cdot x_{4_i}^2 + (\beta_{5_i} \cdot \beta_{6_i}) \cdot (x_{5_i} + x_{6_i}) + \beta_{7_i} \cdot x_{7_i} + \beta_{8_i} \cdot x_{8_i} + \beta_{{13}_i} \cdot x_{{13}_i}\) in order to have the probability inserting randomly features and putting a new link function in this case the cloglog function.

Why this changement? Because, we want to see which model is better… and how can you see this? I’ll tell you which is the way to see which model is better for our data.

What is the link function? So, the link function transforms the probabilities of the levels of a categorical response variable to a continuous scale that is unbounded. Once the transformation is complete, the relationship between the predictors and the response can be modeled with the logistic regression for example.

As we can see, we will reproduce this second model changing the link function to see if it is better or not than the previous model:

\[~\]

## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 241
##    Unobserved stochastic nodes: 6
##    Total graph size: 2694
## 
## Initializing model
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/RtmpsXRjp9/model1da471586646.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
##  n.sims = 2700 iterations saved
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## beta13    -0.318   0.071  -0.454  -0.366  -0.320  -0.271  -0.174 1.001  2700
## beta4     -0.129   0.059  -0.253  -0.167  -0.127  -0.089  -0.022 1.001  2700
## beta5    134.976 248.710 -42.256  -0.024   0.096 174.474 773.329 1.523     8
## beta6     -1.367   7.801 -27.914  -0.065   0.000   0.012  10.151 1.082   300
## beta7      0.165   0.091  -0.016   0.108   0.164   0.227   0.334 1.002  1600
## beta8      0.598   0.105   0.398   0.527   0.597   0.664   0.809 1.000  2700
## deviance 278.457   3.165 274.329 276.141 277.780 280.054 286.110 1.001  2200
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 5.0 and DIC = 283.5
## DIC is an estimate of expected predictive error (lower deviance is better).

\[~\]

Can you see something different? Seems yes.. see the DIC! Is better the previous model (although the difference is too tiny, but the previous model is better).

…mmh, what is the DIC?

\[~\]

The Comparison Between These Two Models (DIC)

\[~\]

As we can see above, we consider a different model changing the link function and considering a polynomial function model, the DIC (Deviance information criterion) is our indicator that indicates the better model.

The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been obtained by Markov chain Monte Carlo (MCMC) simulation. DIC is an asymptotic approximation as the sample size becomes large, like AIC.

the DIC is calculated as:

\[ DIC = p_D + \overline{D(\theta)} \]

where:

The larger the effective number of parameters is, the easier it is for the model to fit the data, and so the deviance needs to be penalized.

Lower is the DIC value better is the accuracy of the model, in this case is better the first model.

I decided to insert also other criterions similar to the DIC, we will see in details what they represent.

\[~\]

Akaike Information Criterion

\[~\]

The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

In estimating the amount of information lost by a model, AIC deals with the trade-off between the goodness of fit of the model and the simplicity of the model. In other words, AIC deals with both the risk of overfitting and the risk of underfitting.

The Akaike information criterion is defined as

\[ AIC(m) = D(\hat{\theta_m}, m) + 2 \cdot d_m \]

It is also a penalized deviance measure with penalty equal to 2 for each estimated parameter. \(d_m\) is the number of parameters considered in the model m.

In our case comparing the two models we have these values of AIC:

## The AIC for the FIRST model is: 211.261641833565
## The AIC for the SECOND model is: 290.456768891642

\[~\]

Bayesian Information Criterion

\[~\]

the Bayesian information criterion (BIC) is a criterion for model selection among a finite set of models; the model with the lowest BIC is preferred. It is based, in part, on the likelihood function and it is closely related to the Akaike information criterion (AIC).

When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in overfitting. BIC introduces a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC.

The Bayesian information criterion is defined as

\[ BIC(m) = D(\hat{\theta_m}, m) + log(n) \cdot d_m \]

Where:

  • n is the number of observations considered

  • \(d_m\) is the number of coefficients considered.

  • \(D(\hat{\theta_m}, m)\) is the deviance of the model

In our case comparing the two models we have these values of AIC:

## The BIC for the FIRST model is: 239.14001730149
## The BIC for the SECOND model is: 311.365550492586

\[~\]

Comparing DIC, AIC, BIC

\[~\]

Here we considered all the criterions together:

##       DIC      AIC      BIC
## 1 204.040 211.2616 239.1400
## 2 283.466 290.4568 311.3656

As we can see, the values are pretty similar. They are only used as indicator to see how much is better the model with our data.

\[~\]

The Conclusions

\[~\] We try to plot the logistic regression using the parameters obtained in the first model, we compare the line for each parameter-response variable:

We prefer considering to show only the significant ones, in terms of visualization.

\[~\]

Parameters Recovery Simulations

\[~\] Now, we understood that the first model is better than the second. So, for checking that the model proposed can correctly recover the model parameters. I Executed the simulation considering the data simulated from the model proposed, the beta parameters checked are the estimated from the first model, considering these values we can continue with our last purpose:

## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 302
##    Unobserved stochastic nodes: 8
##    Total graph size: 3117
## 
## Initializing model
## Inference for Bugs model at "C:/Users/Jeremy/AppData/Local/Temp/RtmpUvKNH7/model588c6e09158b.txt", fit using jags,
##  3 chains, each with 10000 iterations (first 1000 discarded), n.thin = 10
##  n.sims = 2700 iterations saved
##          mu.vect sd.vect    2.5%     25%     50%     75%   97.5%  Rhat n.eff
## beta0      1.428   0.338   0.781   1.196   1.428   1.653   2.089 1.001  2700
## beta10    -0.853   0.185  -1.219  -0.974  -0.852  -0.729  -0.507 1.001  2700
## beta11     0.464   0.156   0.152   0.360   0.461   0.571   0.759 1.001  2700
## beta12    -0.828   0.174  -1.182  -0.940  -0.828  -0.710  -0.502 1.001  2700
## beta13    -0.882   0.171  -1.220  -0.996  -0.874  -0.762  -0.563 1.001  2000
## beta2     -1.867   0.394  -2.616  -2.143  -1.865  -1.600  -1.095 1.001  2700
## beta3      1.248   0.180   0.912   1.126   1.244   1.362   1.609 1.000  2700
## beta4     -0.189   0.146  -0.482  -0.287  -0.186  -0.093   0.097 1.001  2700
## deviance 283.847   3.991 278.102 280.979 283.168 285.995 293.393 1.001  2700
## 
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
## 
## DIC info (using the rule, pD = var(deviance)/2)
## pD = 8.0 and DIC = 291.8
## DIC is an estimate of expected predictive error (lower deviance is better).

\[~\]

The values are slighlty the same, so the parameters are recovered partially well, as we can see below:

##           beta0 beta10 beta11 beta12 beta13  beta2 beta3  beta4
## Estimated 1.253  -0.69  0.645 -0.851 -0.531 -1.621 1.173 -0.381
## True      1.428 -0.853  0.464 -0.828 -0.882 -1.867 1.248 -0.189
## 2.5%      0.499 -1.197  0.193 -1.291 -0.915 -2.531 0.781 -0.765
## 97.5%      2.09 -0.174  1.091 -0.431 -0.155 -0.749 1.607 -0.016
## Recovered   YES    YES    YES    YES    YES    YES   YES    YES

\[~\]

Further Work

\[~\]

Here, I pointed different new improvements that could be done in a future time:

  1. Create different models using different binary classifiers like: SVM, K-Nearest Neighbours, Decision Tree etc..
  2. Oversampling the dataset to see other practical predictions
  3. More interesting plots to describe the data
  4. consider, after oversampling the dataset, a validation set to tune better the parameters
  5. use other link functions
  6. A better feature engineering on the features dataset chosen
  7. Add other metrics in the predictions part like: F1 Score, Recall, Precision etc..

\[~\]

… Final Conclusions

\[~\]

We saw the power and the weakness of Bayesian Analysis.

At first step we tried to face with the model on jags, “cheating” a little bit on which independent variables are maybe good for our purposes and after making our usual considerations, we compared with another model not so much good.. of course many little things could be improved in the future in this amazing analysis.

So thanks for all!

\[~\]

The References

\[~\]

  1. Heart Attack Analysis & Prediction Dataset - [Kaggle] https://www.kaggle.com/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

  2. For graphical plots - [Kaggle] https://www.kaggle.com/chaitanya99/heart-attack-analysis-prediction/data

  3. Link functions, the model types - [Alteryx] https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Selecting-a-Logistic-Regression-Model-Type-Logit-Probit-or/ta-p/111269

  4. HighCharter, for the amazing plots in R - [High Charter] https://jkunst.com/highcharter/articles/hchart.html

  5. Coronary artery diasease - [PubMed] https://pubmed.ncbi.nlm.nih.gov/3739881

  6. Nitroglycerin - [Wikipedia] https://en.wikipedia.org/wiki/Nitroglycerin

  7. Vessel Diases - [Digirad] https://www.digirad.com/triple-vessel-disease-diagnosis-treatment/

  8. Different types of Angina - [CardioSmart] https://www.cardiosmart.org/topics/angina

  9. Logistic Regression Bayesian Model R - [BayesBall] https://bayesball.github.io/BOOK/bayesian-multiple-regression-and-logistic-models.html

  10. Typical Angina - [NCBI] https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5680106

  11. RDocumentations - [RDocumention] https://www.rdocumentation.org

  12. Spectrum of ECG changes during exercise testing - [SomePoMed] https://somepomed.org/articulos/contents/mobipreview.htm?13/38/13930

\[~\]

\[~\]